Recommending Movies

The MovieLens 20M dataset contains 20 million user ratings from 1 to 5 of thousands of movies. In this demo we'll build a simple recommendation system which will use this data to suggest 25 movies based on a seed movie you provide.

The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.


In [1]:
from pymldb import Connection
mldb = Connection()

Download the MovieLens 20M data

We'll start by using some command-line tools to download and decompress the data.


In [2]:
%%bash
mkdir -p /mldb_data/data
curl "http://public.mldb.ai/ml-20m.zip" 2>/dev/null  > /mldb_data/data/ml-20m.zip
unzip /mldb_data/data/ml-20m.zip -d /mldb_data/data


Archive:  /mldb_data/data/ml-20m.zip
replace /mldb_data/data/ml-20m/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:  NULL
(EOF or read error, treating as "[N]one" ...)

In [3]:
%%bash
head /mldb_data/data/ml-20m/README.txt


Summary
=======

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on March 31, 2015.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in four files, `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.

In [4]:
%%bash
head /mldb_data/data/ml-20m/ratings.csv


userId,movieId,rating,timestamp
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
1,112,3.5,1094785740
1,151,4.0,1094785734
1,223,4.0,1112485573
1,253,4.0,1112484940

Load the data into MLDB

See the Loading Data Tutorial guide for more details on how to get data into MLDB.

Here we load a text file and use the pivot aggregator to create a sparse matrix representation of the ratings.


In [5]:
%%time

print mldb.put('/v1/procedures/import_mvlns', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/ratings.csv",
        "outputDataset": "mvlns_ratings_csv",
        "runOnCreation": True
    }
})

print mldb.put('/v1/procedures/process_mvlns', {
    "type": "transform",
    "params": {
        "inputData": """
            select pivot(movieId, rating) as *
            named userId 
            from mvlns_ratings_csv
            group by userId
            """,
        "outputDataset": "mvlns_ratings",
        "runOnCreation": True
    }
})


<Response [201]>
<Response [201]>
CPU times: user 18.4 ms, sys: 4.22 ms, total: 22.7 ms
Wall time: 1min 19s

Take a peek at the dataset

We'll use the Query API. Each row is a user, each column is a movie, and the cell value is the rating the user gave the movie.


In [6]:
mldb.query("select * from mvlns_ratings limit 3")


Out[6]:
21 58 69 163 246 357 441 471 858 908 ... 3255 3260 3386 3418 3425 3654 3713 3865 3879 3897
_rowName
138492 3 4.5 4.5 3.5 4 4.5 5 4.5 5 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
138491 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
138490 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 4 4 3 4 2 4 4 1 3 4

3 rows × 250 columns

Singular Value Decomposition (SVD)

We will create and run a Procedure of type svd.train. This creates an embedding dataset where each row is a movie and the columns represent coordinates in a 100-dimensional space. Similar movies end up closer to each other than dissimilar movies.


In [7]:
print mldb.put('/v1/procedures/mvlns_svd', {
    "type" : "svd.train",
    "params" : {
        "trainingData" : "select COLUMN EXPR (where rowCount() > 3) from mvlns_ratings",
        "columnOutputDataset" : "mvlns_svd_embedding",
        "modelFileUrl": "file://models/mvlns.svd",
        "functionName": "mvlns_svd_embedder",
        "runOnCreation": True
    }
})


<Response [201]>

Explore the results!

Our dataset has movieIds but humans think about movie names so we'll load up the movie names in a dataset.


In [8]:
from ipywidgets import interact, interact_manual
from uuid import uuid4

print mldb.put('/v1/procedures/import_movies', {
    "type": "import.text", 
    "params": {
        "dataFileUrl":"file:///mldb_data/data/ml-20m/movies.csv",
        "outputDataset": "movies",
        "select": "title, movieId",
        "named": "movieId",
        "runOnCreation": True
    }
})


<Response [201]>

A simple search function to find all movies (and corresponding movieIds) whose names contain a string.


In [9]:
@interact
def movie_search(x = "toy story"):
    return mldb.query("select title from movies where regex_match(lower(title), '.*%s.*')" % x.strip().lower())


title
_rowName
115875 Toy Story Toons: Hawaiian Vacation (2011)
3114 Toy Story 2 (1999)
120474 Toy Story That Time Forgot (2014)
106022 Toy Story of Terror (2013)
78499 Toy Story 3 (2010)
1 Toy Story (1995)
120468 Toy Story Toons: Partysaurus Rex (2012)
115879 Toy Story Toons: Small Fry (2011)

Now let's create a dataset to hold user preferences, and a simple function to simulate a user rating movies they like and movies they dislike, based on the movie_search function above.


In [10]:
print mldb.put("/v1/datasets/mvlns_user_prefs", {"type": "sparse.mutable"})

print mldb.put("/v1/functions/preferences", {
    "type": "sql.query",
    "params": {
        "query": "select {*} as p from mvlns_user_prefs where rowName()=$user"
    }
})

def save_prefs(user_id, likes, dislikes):
    for rating, search_terms in zip([5,1],[likes, dislikes]):
        for x in search_terms.split(","):
            if len(x) > 3:
                mldb.post("/v1/datasets/mvlns_user_prefs/rows", {
                     "rowName":user_id, 
                     "columns": [[str(m), rating, 0] for m in movie_search(x).index]
                })
    mldb.post("/v1/datasets/mvlns_user_prefs/commit", {})

save_prefs("janedoe", "Toy Story", "Terminator")
mldb.query("select preferences({ user: 'janedoe' })[p] as *")


<Response [201]>
<Response [201]>
Out[10]:
1 589 1240 3114 4934 6537 68791 78499 102425 106022 115875 115879 120468 120474
_rowName
result 5 1 1 5 1 1 1 5 1 5 5 5 5 5

With all that done, we can now build a recommendation engine out of a simple SQL query by mapping a user's preferences into the same space as the movie embeddings (i.e. embedding the user's preferences) and looking for the nearest movies.


In [11]:
print mldb.put("/v1/functions/nearest_movies", {
    "type": "embedding.neighbors",
    "params": {
        "dataset": "mvlns_svd_embedding",
        "defaultNumNeighbors": 25,
        "columnName": "embedding"
    }
})

print mldb.put("/v1/functions/recommendations", {
    "type": "sql.query",
    "params": {
        "query": """
            select nearest_movies({ 
                    coords: mvlns_svd_embedder({
                        row: preferences({ user: $user })[p]
                    })[embedding] 
                })[distances] as r
        """
    }
})


<Response [201]>
<Response [201]>

Here's a simple function which lets you simulate the results of liking and disliking certain movies and getting back the resulting recommendations.


In [12]:
def recommend(likes="Toy Story, Terminator", dislikes="Star Trek"):
    
    # here we simulate a new user saving these preferences
    user_id = str(uuid4())
    save_prefs(user_id, likes, dislikes)
    
    # we can then run an SQL query to:
    #   - retrieve recommendations
    #   - transpose and join them to movies to get titles
    #   - exclude the already-rated movies from the result
    return mldb.query("""
        select m.title 
        named m.movieId
        from 
            transpose(( select recommendations({ user: '%(user)s' }) )) as r 
            join movies as m on r.rowPathElement(2) = m.rowPathElement(0)
        where m.movieId not in (keys of preferences({ user: '%(user)s' })[p])
        order by r.result
        """ % dict(user=user_id))

recommend(likes="Toy Story, Terminator", dislikes="Star Trek")


Out[12]:
m.title
_rowName
316 Stargate (1994)
1682 Truman Show, The (1998)
2617 Mummy, The (1999)
1222 Full Metal Jacket (1987)
2355 Bug's Life, A (1998)
6365 Matrix Reloaded, The (2003)
3527 Predator (1987)
2329 American History X (1998)
292 Outbreak (1995)
145 Bad Boys (1995)
4886 Monsters, Inc. (2001)
2000 Lethal Weapon (1987)
1784 As Good as It Gets (1997)
434 Cliffhanger (1993)
1961 Rain Man (1988)
349 Clear and Present Danger (1994)
2115 Indiana Jones and the Temple of Doom (1984)
5445 Minority Report (2002)
594 Snow White and the Seven Dwarfs (1937)
1641 Full Monty, The (1997)
2003 Gremlins (1984)

Here's an interactive form that lets you play with this function to see if you agree with the recommendations!

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.


In [13]:
interact_manual(recommend)


m.title
_rowName
316 Stargate (1994)
1682 Truman Show, The (1998)
2617 Mummy, The (1999)
1222 Full Metal Jacket (1987)
2355 Bug's Life, A (1998)
6365 Matrix Reloaded, The (2003)
3527 Predator (1987)
2329 American History X (1998)
292 Outbreak (1995)
145 Bad Boys (1995)
4886 Monsters, Inc. (2001)
2000 Lethal Weapon (1987)
1784 As Good as It Gets (1997)
434 Cliffhanger (1993)
1961 Rain Man (1988)
349 Clear and Present Danger (1994)
2115 Indiana Jones and the Temple of Doom (1984)
5445 Minority Report (2002)
594 Snow White and the Seven Dwarfs (1937)
1641 Full Monty, The (1997)
2003 Gremlins (1984)

Where to next?

Check out the other Tutorials and Demos.


In [ ]: